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Abstract. We establish the existence of optimal scheduling strategies 
for time-bounded reachability in continuous-time Markov decision pro- 
cesses, and of co-optimal strategies for continuous-time Markov games. 
Furthermore, we show that optimal control does not only exist, but has 
a surprisingly simple structure: The optimal schedulers from our proofs 
are deterministic and timed-positional, and the bounded time can be di- 
vided into a finite number of intervals, in which the optimal strategies are 
positional. That is, we demonstrate the existence of finite optimal con- 
trol. Finally, we show that these pleasant properties of Markov decision 
processes extend to the more general class of continuous-time Markov 
games, and that both early and late schedulers show this behaviour. 



1 Introduction 

Continuous-time Markov decision processes (CTMDPs) are a widely used frame- 
work for dependability analysis and for modelling the control of manufacturing 
processes [12,6], because they combine real-time aspects with probabilistic be- 
haviour and non-deterministic choices. CTMDPs can also be viewed as a frame- 
work that unifies different stochastic model types [14, 12, 9, 7, 10]. 

While CTMDPs allow for analysing worst-case and best-case scenarios, they 
fall short of the demands that arise in many real control problems, as they 
disregard the different nature that non-determinism can have depending on its 
source: Some sources of non-determinism are supportive, while others are hostile, 
and in a realistic control scenario, we face both types of non-determinism at the 
same time: Supportive non-determinism can be used to model the influence of 
a controller on the evolution of a system, while hostile non-determinism can 
capture abstraction or unknown environments. We therefore consider a natural 
extension of CTMDPs: Continuous-time Markov games (CTMGs) that have two 
players with opposing objectives [5]. 
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Fig. 1. A CTMDP and the reachability probabihties for aU positional schedulers 
with time bound imax = 1- Time t' — imax — 5 log(2) is the optimal time to switch 
to action h. 

The analysis of CTMDPs and CTMGs requires to resolve the non- 
deterministic choices by means of a scheduler (which consists of a pair of strate- 
gies in the case of CTMGs), and typically tries to optimise a given objective 
function. 

In this paper, we study the time-hounded reachability problem^ which recently 
enjoyed much attention [3, 16, 10, 11, 5]. Time-bounded reachability in CTMDPs 
is the standard control problem to construct a scheduler that controls the Markov 
decision process such that the probability of reaching a goal region within a 
given time bound is maximised (or minimised), and to determine the value. For 
CTMGs, time-bounded reachability reduces to finding a Nash equilibrium, that 
is, a pair of strategies for the players, such that each strategy is optimal for the 
chosen strategy of her opponent. 

While continuous-time Markov games are a young field of research [5,13], 
Markov decision processes have been studied for decades [8,4]. 

Optimal control in CTMDPs clearly depends on the observational power we 
allow our schedulers to have when observing a run. In the literature, various 
classes of schedulers with different restrictions on what they can observe [15, 10, 
5, 13] are considered. We focus on the most general class of schedulers, schedulers 
that can fully observe the system state and may change their decisions at any 
point in time {late schedulers, cf. [4, 10]). To be able to translate our results to 
the more widespread class of schedulers that fix their decisions when entering 
a location (early schedulers), we introduce discrete locations that allow for a 
translation from early to late schedulers (see Appendix D). 

Due to their practical importance, time-bounded reachability for continuous- 
time Markov models has been studied intensively [5,4,11,3,10,1,2,16]. How- 
ever, most previous research focussed on approximating optimal control. (The 
existence of optimal control is currently only known for the artificial class of 
time-abstract schedulers [5,13], which assume that the scheduler has no access 
whatsoever to a clock.) While an efficient approximation is of interest to a prac- 
titioner, being unable to determine whether or not optimal control exists is very 
dissatisfying from a scientific point of view. 
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Contributions. This paper has three main contributions: First, we extend the 
common model of CTMDPs by adding discrete locations, which are passed in 
time. This generalisation of the model is mainly motivated by avoiding the 
discussion about the appropriate scheduler class. In particular, the widespread 
class of schedulers that fix their actions when entering a location can be encoded 
by a simple mapping. 

The second contribution of this paper is the answer to an intriguing research 
question that remained unresolved for half a century: We show that optimal 
control of CTMDPs exists for time-bounded reachability and safety objectives. 
Moreover, we show that optimal control can always be finite. 

Our third contribution is to lift these results to continuous-time Markov 
games. 

Pursuing a different research question, we exploit proof techniques that differ 
from those frequently used in the analysis of CTMDPs. Our proofs build mainly 
on topological arguments: The proof that demonstrates the existence of measur- 
able optimal schedulers, for example, shows that we can fix the decisions of an 
optimal scheduler successively on closures of open sets (yielding only measurable 
sets), and the lift to finiteness uses local optimality of positional schedulers in 
open left and right environments of arbitrary points of times and the compact- 
ness of the bounded time interval. 

Structure of the Paper. We follow a slightly unorthodox order of proofs for a 

mathematical paper: wc start with a special case in Section 3 and generalise the 
results later. Besides keeping the proofs simple, this approach is chosen because 
the simplest case, CTMDPs, is the classical case, and we assume that a wider 
audience is interested in results for these structures. In the following section, 
we strengthen this result by demonstrating that optimal control does not only 
exist, but can be found among schedulers with finitely many switching points 
and positional strategies between them. In Section 5, we lift this result to single 
player games (thus extending it to other scheduler classes like those which fix 
their decision when entering a location, cf. Appendix D). In the final section, we 
generalise the existence theorem for finite optimal control to finite co-optimal 
strategies for general continuous-time Markov games. 

2 Preliminaries 

A continuous-time Markov game is a tuple {L,Ld,Lc,Lr,Ls,G,Act,'R,P,i'), 
consisting of 

— a finite set L of locations, which is partitioned into 

• a set Ld of discrete locations and a set Lc of continuous locations, and 

• sets Lr and Ls of locations owned by a reachability and a safety player, 

— a dedicated set G C L of goal locations, 

— a finite set Act of actions, 

— a rate matrix R : {Lc x Act x L) — >• Q^o, 

— a discrete transition matrix P : (i x Act x L) ^ Q^o H [0, 1], and 
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— an initial distribution v € Dist(L), 



that satisfies the following side-conditions; For all continuous locations / G L^, 
there must be an action a € Act such that R(Z, a, L) := '}2i'eL -^(^ ^ 
call such actions enabled. For actions enabled in continuous locations, we require 
P{l,a,l') = , and we require P{l,a,l') = for the remaining actions. 

For discrete locations, we require that either P{l,a,l') = holds for all I' G L, 
or that X!;/^^ P(', a, I') = 1 holds true. Like in the continuous case, wc call the 
latter actions enabled and require the existence of at least one enabled action 
for each discrete location I G Ld- 

The idea behind discrete-time locations is that they execute immediately. We 
therefore do not permit cycles of only discrete-time locations (counting every 
positive rate of any action as a transition). This restriction is stronger than it 
needs to be, but it simplifies our proofs, and the simpler model is sufficient for 
our means. 

We assume that the goal region is absorbing, that is P{l,a,l') = holds 
for all / € G and I' ^ G. See Section 7 for the extension to non-absorbing goal 
regions. 

Intuitively, it is the objective of the reachability player to maximise the prob- 
ability to reach the goal region in a predefined time to, while it is the objective 
of the safety player to minimise this probability. (Hence, it is a zero-sum game.) 

We are particularly interested in (traditional) CTMDPs. They are single 
player CTMGs, where either all positions belong to the reachability player (L = 
Lr), or to the safety player (L = Lg), without discrete locations {Lj, = and 

= L). 

Paths. A timed path tt in a CTMG M. is a finite sequence in L x {Act x R^o x 
L)* = Paths{M). We write 

, ao,to , ai.ti o„_i,t„_i 

lo > li > ■ ■ ■ > In 

for a sequence tt, and we require < ti-i < ti < t,„ax for all i < n, where tmax 
is the time bound for our time-bounded reachability probability. (We are not 
interested in the behaviour of the system after tmax-) The ti denote the system's 
time when the action Ui is selected and a discrete transition from li to k+i takes 
place. Concatenation of paths tt, tt' will be written as tt o tt' if the last location 
of TT is the first location of tt' and the points of time are ordered correctly. We 
call a timed path a complete timed path when we want to stress that this path 
describes a complete system run, not to be extended by further transitions. 

Schedulers and Strategies. The nondeterminism in the system needs to be 
resolved by a scheduler which maps paths to decisions. The power of schedulers 

is determined by their ability to observe and distinguish paths, and thus by their 
domain. In this paper, we consider the following common scheduler classes: 

— Timed history- dependent (TH) schedulers Paths (M) x R^o — >■ D 
that map timed paths and the remaining time to decisions. 
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— Timed positional (TP) schedulers L x M^o — > D 
that map locations and the remaining time to decisions. 

— Positional (P) or memoryless schedulers L ^ D 
that map locations to decisions. 

Decisions D are either randomised (R), in which case D = Dist{Act) is the 
set of distributions over enabled actions, or are restricted to deterministic (D) 
choices, that la D = Act. Where it is necessary to distinguish randomised and 
deterministic versions we will add a postfix to the scheduler class, for example 
THD and THR. 

Straiegies. In case of CTMGs, a scheduler consists of the two participating 
players' strategies, which can be seen as functions Paths{M.p) x R^o D, 
where Paths{Mp) denotes, for p e {r,s}, the paths ending on the position of 
the reachability or safety player, respectively. As for general schedulers, we can 
introduce restrictions on what players are able to observe. 

Discrete locations. The main motivation to introduce discrete locations was 
to avoid the discussion whether a scheduler has to fix his decision, as to which 
action it chooses, upon entering a location, or whether such a decision can be 
revoked while staying in the location. For example, the general measurable sched- 
ulers discussed in [15] have only indirect access to the remaining time (through 
the timed path) , and therefore have to decide upon entrance of a location which 
action they want to perform. Our definition builds on fully-timed schedulers 
(cf. [4]) that were recently rediscovered and formalised by Neuhaufier ct al. [10], 
which may revoke their decision after they enter a location. (As a side result, 
we lift Neuhaufier's restriction to local uniformity.) The discrete locations now 
allow to encode making the decision upon entering a continuous location I by 
mapping the decision to a discrete location that is 'guarding the entry' to a fam- 
ily of continuous locations, one for each action enabled in I. (See Appendix D 
for details.) 

Cylindrical Schedulers. While it is common to refer to TH schedulers as 

a class, the truth is that there is no straightforward way to define a measure 
for the time-bounded reachability probability for the complete class (cf. [15]). 
We therefore turn to a natural subset that can be used as a building block for a 
powerful yet measurable sub-class of TH schedulers, which is based on cylindrical 
abstractions of paths. 

Let JT" be a finite partition of the interval [0, tmax] into intervals Iq = [0, to] 
and li = [ti^i^ti] for i = 1, . . . ,n with to > and ti > ti^i for i = 1, . . . ,n, 
where t„ = tmax is the time-bound from the problem definition. Then we denote 
with [t]j the interval li & J that contains t, called the J -cylindrification of 

, , J J. -4-1, r 1 1 "O'l*!)]^, , On-l,[tt._i]j- 

t, and we denote with = to > ti > ■ ■ ■ > In the 

J -cylindrification of the timed path tt = Zq "°'*°> h "^'*^> • • • ^' "~^> In- 

We call a TH scheduler -cylindrical if its decisions depend only on the 
cylindrification and [t]j of tt and t, respectively, and cylindrical if it is 

j7-cylindrical for some finite partition J of the interval / = [0, tmax]- 
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Cylindrical Sets and Probability Space. For a given finite partition J of 
the interval [0, tmax], an J -cylindrical set of timed paths is the set of timed paths 
with the same J^-cylindrification, and we call a finite partition J' of [0, tmax] a 
refinement of J if every interval in J is the union of intervals in J' . 

For an J'-cylindrical scheduler S and an j7'-cylindrical sot of finite timed 
paths, where J' is a refinement^ of J ^ the likelihood that a complete path is from 
this cylindrical set is easy to define: Within each interval of JT", the likelihood that 
a CTMDP M. with scheduler <S behaves in accordance with the J^'-cylindrical 
set can — assuming compliance in all previous intervals — be checked like for a 
finite Markov chain. 

The probability p/^ to comply with the «-th segment of the partition J' of 
[Ojtmax] is the product of three multiplicands (p/. = • 'Pi')'- 

1. the probability Pi that the actions are chosen in accordance with the J'- 
cylindrical set of timed paths (which is either or 1 for discrete schedulers, 
and the product of the likelihood of the individual decisions for randomised 
schedulers), 

2. the probability P2 that the transitions are taken in accordance with the J'- 
cylindrical set of timed paths, provided the respective actions arc chosen, 
which is simply the product over the individual probabilities P(Zj, a^, /j+i) 
in this sequence of the ^^'-cylindrical set of timed paths, and 

3. the probability P3' that the right number of steps is made in this sequence 
of the iJ'-cylindrical set of timed paths. 

The latter probability Pg' is if the last location is a discrete location, as 
the system would leave this location at the same point in time in which it was 
entered. Otherwise, it is the difference = p^^ —p^^ between the likelihood that 
at least the correct number of n > transitions starting in continuous locations 
are made {pi*), and the likelihood that at least n + 1 transitions starting in 
continuous locations are made (pg* ) in the relevant sequence of the timed path. 

Let lo,li, . . . ,l„ be the n continuous locations (named in the required order 
of appearance; note that n might be 0, and that the same location can occur 
multiple times), and let Aq, Ai, . . . , A„ be the transition rate one would observe 
at the respective Ij. For deterministic schedulers, this transition rate is simply 
Aj = R(Ii, ai,L), where a, is the decision from <S at the respective position in a 
timed path and in Jj. For a randomised scheduler S, it is the respective expected 
transition rate Aj = ^^^^^^^j.^ PaR(Ii, a, L), where Pa is the likelihood that S 
makes the decision a at the respective position in a timed path and in 7j. Note 
that the locations and transition rates are fixed. 

The likelihood to get a path of length > n is then 

/(ro,...,r„_i)6^^„,. UkZl^ke-^''^''dTk for $n,i = {(tq, . . . , r„_l) € [0, W]" | 

^"~Q^Tj < ti — ti-i} for n > 0, and 1 for n = 0. Likewise, the likelihood to 
get a path of length > n + 1 is I(ro,...,Tn)e^i+i HLo >^ke~^'''"'dTk for ^n+i,i = 

^ The restriction to partitions J' tliat refine J7 is purely technical, because for ar- 
bitrary J' we can simply use a partition J" that refines both J and J' , and 
reconstruct every j/'-cylindrical set as a finite union of j7"-cylindrical sets. 



6 



{(to, . . . ,t„) € [0,tmax]"+^ I I]"=o '^3 < " ^i-i}- (Recall that U and are 
the upper and lower endpoints of the interval 7^.) 

The likelihood that a complete timed path is in the jT'-cylindrical set of 
timed paths for S is the product YIk^ji Pi over the individual pi- . 

Probability Space. Having defined a measure for the likelihood that, for a 
given cylindrical scheduler, a complete timed path is in a particular cylindrical 
set, we define the likelihood that it is in a finite union of disjoint sets of cylindrical 
paths as the sum over the likelihood for the individual cylindrical sets. 

This primitive probability measure for primitive schedulers can be lifted in 
two steps by a standard space completion, going from this primitive measures 
to quotient classes of Cauchy sequences of such measures: 

1. In a first step, wc complete the space of measurable sets of complete timed 
paths from finite unions of cylindrical sets of timed paths to Cauchy se- 
quences of finite unions of cylindrical sets of timed paths. We define the 
required difference measure of two sets of timed paths (each a finite disjoint 
union of cylindrical sets) as the measure of the symmetrical difference of 
the two sets. This set can obviously be represented as a finite disjoint union 
of cylindrical sets of timed paths, and we can use our primitive measure to 
define this difference measure. 

2. Having lifted the measure to this completed space of paths, we lift the set of 
measurable schedulers in a second step from cylindrical schedulers to Cauchy 
sequences of cylindrical schedulers. (The difference measure between two 
cylindrical schedulers is the likelihood that two schedulers act observably 
different.) 

More details of these standard constructions can be found in Appendix A. 2. 

Time-Bounded Reachability Probability. For a given CTMG 
{L, Ld, Lc, Lr, Ls,G, ActjRjPjU) and a given measurable scheduler S that re- 
solves the non-determinism, we use the following notations for the probabilities: 

— Pr^ {I, t) is the probability of reaching the goal region G within time t when 
starting in location /, 

— Pr'^{t) = ^i^i,v{l)Prs^ {l,t) denotes the probability of reaching the goal 
region G within time t. 

As usual, the supremum of the time-bounded reachability probability over a 
particular scheduler class is called the time-bounded reachability of M. for this 
scheduler class. 

3 Optimal Scheduling in CTMDPs 

In this section, we demonstrate the existence of optimal schedulers in traditional 
CTMDPs. Before turning to the proof, let us first consider what happens if time 
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runs out, that is, at time imax, and then develop an intuition what an optimal 
scheduling policy should look like;. 

If we are still in time {t < imax) and we are in a goal location, then we reach 
a goal location in time with probability 1; and if time has run out {t = tmax) 
and we are not in a goal location, then we reach a goal location in time with 
probability 0. For ease of notation, we also fix the probability of reaching the goal 
location in time to for all points in time strictly after fmax- For a measurable 
TPR scheduler <S, we would get: 

— Pr'^{l,t) = 1 holds for all goal locations I £ G and all t < tmax, 

— Prg^ {l,tniii^) = holds for all non-goal locations I ^ G, and 

— Pr^{l,t) = holds for all locations / e L and all t > tmax- 

A scheduler S can, in every point in time, choose from distributions over 
successor locations. Such a choice should be optimal, if the expected gain in the 
probability of reaching the goal location is maximised. 

This gain has two aspects: first, the probability of reaching the goal location 
provided a transition is taken, and second, the likelihood of taking a transition. 
Both are multiplicands in the defining differential equations, assuming a cylin- 
drical TPD scheduler S 

-PrP{l,t) = J2 ^{l,S{ht),l') ■ {Pr^{l',t) - Pr^{l,t)) . 

l'£L 

(We skip the simple generalisation to measurable TPD scheduler because it is 
not required in the following proofs.) 

The reachability probability of any scheduler is therefore intuitively dom- 
inated by the function /max and dominates the function /min defined by the 
following equations: 

-/opt(«, t) = opt V R(Z, a, I') ■ (/opt(Z', t) - /opt(/, t)) for t G [0, W], 

o e Act{i) 

where opt e {min, max}. This intuitive result is not hard to prove^. 

Lemma 1. The reachability probability of any measurable THR scheduler is 
dominated by the function /max and dominates the function /min- 

Proof Idea: To proof this claim for /max, assume that there is a scheduler that 
provides a better time-bounded reachability probability Pr^{l,t) > /max(',i) 

The systems of non-linear ordinary differential equations used in this paper are all 
quite obvious, and the challenge is to prove that they can be taken and not merely 
approximated. An approximative argument for these ODE's goes back to Bellman 
[4], but he uses a less powerful set of schedulers, and only proves that /max and /min 
can be approximated from below and above, respectively, claiming that the other 
direction is obvious. After starting with a similar claim, we were urged to include a 
full proof. 
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for some location I G L and time t € [0, tmax] (in particular for t = 0), and hence 
improves over /max(^)i) at this position by at least 3c for some e > 0. 

5 is a Cauchy sequence of cylindrical schedulers. Therefore we can sacrifice 
one e and get an £-close cylindrical scheduler from this sequence, which is still 
at least 2s better than /max at position {l,t). 

As the measure for this cylindrical scheduler is a Cauchy sequence of measures 
for sequences with a bounded number of discrete transitions, we can sacrifice an- 
other s to sharpen the requirement for the scheduler to reach the goal region in 
time and with at most steps for an appropriate bound € N, still maintain- 
ing an e advantage over /max- Hence, we can compare with a finite structure, 
and use an inductive argument to show for paths n of shrinking length that end 
in any location I' € L that /max(^',0 < -P'"5^ ('"'i holds true. □ 

The full proof is moved to Appendix B.2. 

Theorem 1. For a CTMDP. there is a measurable TPD seheduler S optimal for 
maximum time-hounded reachability in the class of measurable THR scheduler. 

Proof. Wc construct a measurable scheduler S that always chooses an action a 
that maximises Ei'ei R-l^^ t), I') ■ {Pr-^ {I' , t) - Pr^{l, t)); by Lemma 1, this 
guarantees that J2ieL v{l)Prs^ 0) = EieL «^(0/max(Z, 0) = sup Pr^ (imax) 

SeTHR 

holds true. 

To construct the scheduler decisions for a location I for a measurable sched- 
uler S, we partition [0, fmax] into measurable sets {Da \ a € Act{l)}, such that S 
only makes decisions that maximise R(/, a, • [Prg^{l',t) — Prg^{l,t)). 

(For positions outside of [0,tmax]j the behaviour of the scheduler does not mat- 
ter. S{1, t) can therefore be fixed to any constant decision a G Act{l) for all / G L 
and t ^ [0,tmax]-) 

We start with fixing an arbitrary order >- on the actions in Act{l) and intro- 
duce, for each point t G [0,imax], an order on the actions determined by the 
value of X^ceL ^0 ' (/max(/', t) - fma^ih t)) , using >- as a tie-breaker. 

Along the order of we construct, starting with the minimal element, for 
each action a G Act{l): 

1. Open sets Oa that contain the positions where our scheduler does not make 
a decision^ a' < a. 

(We choose Oa = [0,tmax] for the action a that is minimal with respect to 

2. A set Ta that is open in Oa and contains the points in time in Oa where a 
is maximal with respect to ^t- 

3. A set Do = 7a n Oa which is the closure of Ta in Oa. 

If a is not maximal, we set Oa' = Oa \ Da for the successor a' of a with 
respect to >-. 

^ Note that, for all a' -< a, the points T^' in time where the scheduler does make the 
decision a' have been fixed earUer by this construction. 
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To complete the proof, we have to show that the scheduler S, which 
chooses a for all t G Da, makes only decisions that maximise the gain, that 
is Y.i'eL^i^^"-^n- (/inax(^',t) - /max(^,i)) (and hence that /max = -Pr^" holds 
true), and we have to show that the resulting scheduler is measurable. As an 
important lemma on the way, we have to demonstrate the claimed openness of 
the Oo's and Ta's in the compact Euclidean space [0,ti„ax]- 

This openness is provided by a simple inductive argument: First, the complete 
space [0, tmax] is open in itself. 

Let us assume that a is maximal w.r.t. Pt for a t in the open set Oa- Then 
the following holds: R(/, a, • (/max(^',0 - /max(/,t)) is strictly greater 

for a comparcid to the respective value of all other actions a' y a (because >- 
serves as tie-breaker), and hence this holds for some £-environment of t that is 
contained in the open set Oa- For the actions a' -< a in this £-environment of 
t, the respective value also cannot be strictly greater compared to a, because 
otherwise one of these actions had been selected before. 

Note that this argument provides optimality of the choices in as well as 
openness of T^. The optimality for the choices at the fringe of Ta (and hence the 
extension of the optimality argument to -Da) is a consequence of the continuity 

of /max- 

As every open and closed set in R is (Lebesgue) measiirable, Oa (or, to be 
precise, Oa n (0,tmax)) and Ta, and hence their intersection Da = Tad Oa, are 
measurable. 

Our construction therefore provides us with a measurable scheduler, which 
is optimal, deterministic, and timed positional. □ 

By simply replacing maximisation by minimisation, sup by inf, and max by 
min, we can rewrite the proof to yield a similar theorem for the minimisation 
of time-bounded reachability, or likewise, for the maximisation of time-bounded 
safety. 

Theorem 2. For a CTMDP, there is a measurable TPD scheduler S optimal for 
minimum time-bounded reachability in the class of measurable THR scheduler. 

4 Finite Optimal Control 

In this section we show that, once the existence of an optimal scheduler is es- 
tablished, we can refine this result to the existence of a cylindrical optimal TPD 
scheduler, that is, a scheduler that changes only finitely many times between 
different positional strategies. This is as close as we can hope to get to imple- 
mentability as optimal points for policy switching are — like in the example from 
Figure 1 — almost inevitably irrational. 

Our proof of Theorem 1 makes a purely topological existence claim, and 
therefore does not imply that a finite number of switching points suffices. In 
principle, this could mean that the required switching points have one or more 
limit points, and an unbounded number of switches is required to optimise time- 
bounded reachability, x ■ sin{x~^) (cf. Figure 2) is an example for a continuous 
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Fig. 2. The plot on the left shows the function x ■ sin(a;~^) as an example for the 
limit point problem. The plot on the right: Theorem 1 would not exclude that the 
intersection points of derivatives (the 'loss') of two positional schedulers have a limit 
point. 

function for which the codomain of has a limit point at 0, and the right curve of 
Figure 2 shows the derivations for a positional scheduler (black) and a potential 
comparison with a gain function such that their intersections have a limit point. 

To exclude such limit points, and hence to prove the existence of an optimal 
scheduler with a finite number of switching points, we re-visit the differential 
equations that define the reachability probability, but this time to answer a 
different question: Can we use the true values in some point of time to locally 
find an optimal strategy for an e- environment? If yes, then we could exploit 
the compactness of [0,tmax]: We could, for all points in time t € [0,i,„ax], fix a 
decision that is optimal in an e-environment of t. This would provide an open set 
with a positional optimal strategy around each t G [0,tniax], and hence an open 
coverage of a compact set, which would imply a final coverage with segments of 
positional optimal strategies. 

While this is the case for most points, this is not necessarily the case at our 
switching points. In the remainder of this section, we therefore show something 
similar: For every point t € [0, tmax] in time, there is a positional strategy that is 
optimal in a left e-environment of t (that is, in a set {t — e, H [0, ^max]), and one 
that is optimal in a right e-environment of t. Hence, we get an open coverage of 
strategies with at most one switching point, and thus obtain a strategy with a 
finite number of switching points. 

Theorem 3. For every CTMDP, there is a cylindrical TPD scheduler S op- 
timal for maximum time-bounded reachability in the class of measurable THR 
scheduler. 

Proof. We have seen that the true optimal reachability probability is defined by 
a system of differential equations. In this proof we consider the effect of starting 
with the 'correct' values for a time t G [0, tmax], but locally fix a positional strategy 
for a small left or right e-environment of t. That is, we consider only schedulers 
that keep their decision constant for a (sufficiently) small time e before or after t. 
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always a 
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always a 
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Fig. 3. The gain functions of the competing stationary strategies of Figure 1. To the 
left: Developed gains Pr^^"'^'' (t) from t = tmax = 1 in order to find the only switching 
point t' = traax — | log(2). To the right: Developed values Pr^^^''{t) (not gains) from 
t' in both directions, to show that action a is better for all t < t' whereas b is better for 
all t > t' . This construction is also used to determine the existence of e-environments 
with a stable strategy around every point t. 



Given a CTMDP A^, we consider the differential equations that describe 
the development near the support point /maxd^) foi' each location / under a 
positional strategy D: 

-PrPir) = J2^{l,ai,n ■ {Pr^ {r) ~ Prf (r)) , 

where a; is the action chosen at / by (see Figure 3 for an example). 

Different to the development of the true probability, the development of these 
linear differential equations provides us with smooth functions. This provides us 
with more powerful techniques when comparing two locally positional strategies: 
Each deterministic scheduler defines a system y = Ay of ordinary homogeneous 
linear differential equations with constant coefficients. 

As a result, the solutions Prf (r) of these differential equations — and 
hence their differences Prf (r) — Pr^ {t) — can be written as finite sums 
^"^^ Pi(r)e^''^, where Pi is a polynomial and the \i may be complex. Con- 
sequently, these functions are holomorphic. 

Using the identity theorem for holomorphic functions, t can only be a limit 
point of the set of points of Prf (t) - Prf (r) if Pr^' (t) and Prf (r) are 
identical on an e-environment of t. The same applies to their derivations: 
Prf (t) — Prfij) either has no limit point in t, or Pr^ (r) and Prf{T) are 
identical on an e-environment of t. 

For the remainder of the proof, we fix, for a given time t, a sufficiently 
small e > such that, for each pair of schedulers D and D' and every location 
/ G L, Prf (r) — Prf'{T) is either < 0, = 0, or > on the complete interval 
L\ = [t — e, t) n [0, tmax] 9 T, and, possibly with different sign, for the complete 
interval i?* = {t,t + e)r\ [0, tmax] 3 r. 
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We argue the case for the left e-environment L* . In the '>' case for a location 
I, we say that D is I -better than D' . We call D preferable over D' if D' is not 
Z-better than D for any location I, and better than D' if D is preferable over D' 
and Z-better for some I G L. 

If £)' is Z-better than D in exactly a non-empty sot Lf, C L of locations, then 
we can obviously use D' to construct a strategy D" that is better than D by 
switching to the strategies of D' in exactly the locations L),. 

Since we choose our strategies from a finite domain — the deterministic po- 
sitional schedulers — this can happen only finitely many times. Hence we can 
stepwise strictly improve a strategy, until we have constructed a strategy l?max 
preferable over all others. 

By the definition of being preferable over all other strategies, Dmax satisfies 

_p^D^.. (r) = max V R(/, a, I') ■ [Pr^"'^- (r) - Prf (r)) 

aeAcm 

for all T e L* and all I G L. 

We can use the same method for the right ^-environment i?*, and pick the 
decision for t arbitrarily; we use the decision from the respective left e environ- 
ment. 

Now we have fixed, for an e-environmcnt of an arbitrary t G [0,t,nax], an 
optimal scheduler with at most one switching point. As this is possible for all 
points in [0,fi„ax], the sets 7j = L* U Rl define an open cover of [0,finax]- Using 
the compactness of [0,imax], we infer a finite sub-cover, which establishes the 
existence of a strategy with a finite number of switching points. □ 

The proof for the minimisation of time-bounded reachability (or maximisa- 
tion of time-bounded safety) runs accordingly. 

Theorem 4. For every CTMDP, there is a cylindrical TPD scheduler S optimal 
for minimal time-bounded reachability in the class of measurable THR scheduler. 



5 Discrete Locations 

In this section, wo treat the mildly more general case of single player CTMGs, 
which arc traditional CTMDPs phis discrete locations. We reduce the problem 
of finding optimal measurable schedulers for CTMGs first to simple CTMGs, 
CTMGs whose discrete locations have no incoming transitions from continuous 
locations. (They hence can only occur initially at time 0.) The extension from 
CTMDPs to simple CTMGs is trivial. 

Lemma 2. For a simple single player CTMG with only a reachability (or only 
a safety) player, there is an optimal deterministic scheduler with finitely many 
switching points. 
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Proof. By the definition of simple single player games, the likelihood of reaching 
the goal location from any continuous location and any point in time is indepen- 
dent of the discrete locations and their transitions. For continuous locations, we 
can therefore simply reuse the results from the Theorems 3 and 4. 

We can only be in discrete locations at time 0, and for every continuous 
location I there is a fixed time-bounded reachability probability described by 
/opt(',0). We can show that there is a timed-positional (even a positional) de- 
terministic optimal choice for the; discrete locations at time t = by induction 
over the maximal distance to continuous locations: If all successors have been 
evaluated, we can fix an optimal timed-positional choice. We can therefore use 
discrete positions with maximal distance 1 as induction basis, and then apply 
an induction step from positions with distance < n to positions with distance 
n + 1. □ 

Rebuilding a single player CTMG ^ to a simple single player CTMG Gs can 
be done in a straight forward manner; it suffices to pool all transitions taken 
between two continuous locations. To construct the resulting simple CTMG Qs, 
we add new continuous locations for each possible time abstract path from con- 
tinuous locations of the CTMG Q, and we add the respective actions: For con- 
tinuous locations 1,,, I'c G and discrete locations lf,...,l'^ G a timed path 
If ^ ld...'idj^ translates to I, ^ ^ij^lj--- ij ^ I', , 
where the underlined part is a new continuous location. (For simplicity, we also 

translate a timed path 1^ 1'^ to — > 'c-) 

The new actions of the resulting simple single player CTMG encode the 
sequences of actions of Q that a scheduler could make in the current location 
plus in all possible sequences of discrete locations, until the next continuous 
location is reached. (Note that this set is finite, and that the scheduler makes all 
of these transitions at the same point of time.) If a encodes choices that depend 
only on the position (but not on this local history), a is called positional. For 
continuous locations, all old actions are deleted, and all new continuous locations 
that end in a location 1^ G Lc get the same outgoing transitions as Ic- The rate 
matrix is chosen accordingly. 

Adding the information about the path to locations allows to reconstruct 
the timed history in the single player CTMG from a history in the constructed 
simple CTMG. 

Theorem 5. For a single player CTMG Q with only a reachability (or only 
a safety) player, there is an optimal deterministic scheduler with finitely many 
switching points. 

Proof. First, every sclicdiilcr S for Q can bo naturally translated into a scheduler 
of Ss of Qs, because every timed-path in Qs defines a timed-path in Q; the 
resulting time-bounded reachability probability coincides. 

Let us consider a cylindrical optimal deterministic scheduler Sopt for the 
simple Markov game, and the function /opt defined by it. For the actions a <Sopt 
chooses, we can, for each interval in which <Sopt, is positional, use an inductive 
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argument similar to the one from the proof of Lemma 2 to show that we can 
choose a positional action a' instead. The resulting cylindrical deterministic 
scheduler S'^^i defines the same /opt (same differential equations). 

Clearly, fopt{lc,t) = /opt( - ■ ■ — > lc ,t) holds true. We use this observation to 
change S'^^^ to iS"pt by choosing the action that S^^^ chooses for Ic for all locations 
. . . A- Zc and at each point of time. The resulting scheduler S^p^ is still cylindrical 
and deterministic, and defines the same /opt (same differential equations). 

<S" t is also the mapping of a cylindrical optimal deterministic scheduler for 



6 Continuous-Time Markov Games 

In this section, we lift our results from single player to general continuous-time 
Markov games. In general continuous-time Markov games, we are faced with two 
players with opposing objectives: A reachability player trying to maximise the 
time-bounded reachability probability, and a safety player trying to minimise 
it — we consider a 0-sum game. 

Thus, all we need to do for lifting our results to games is to show that the 
quest for optimal strategies for single player games discussed in the previous 
section can be generalised to a quest for co-optimal strategies — that is, for Nash 
equilibria — in general games. To demonstrate this, it essentially suffices to show 
that it is not important whether we first fix the strategy for the reachability 
player and then the one for the safety player in a strategy refinement loop, or 
vice versa. 

Let us first assume CTMGs without discrete locations. 

Lemma 3. Using the e- environments /* from the proof of Theorem 3, we can 
construct a Nash equilibrium that provides co- optimal deterministic strategies 
for both players, such that the co-optimal strategies contain at most one strategy 

switch on /*. 

Proof. We describe the technique to find a constant co-optimal strategy on the 
right e-environment i?* = {t,t + £)r\ [0, tmax] of t. 

We write a constant strategy as D = S + R that is composed of the actions 
chosen by the safety player on Lg, and the actions chosen by the reachability 
player on Lr- For this simple structure, we introduce a strategy improvement 
technique on the finite domain of deterministic choices for the respective player. 

For a fixed strategy S of the safety player, we can find an optimal counter 
strategy R{S) of the reachability player by applying the technique described in 
Theorem 3. (For equivalent strategies, we make an arbitrary but fixed choice.) 

We call the resulting vector (—■P»"5^^(s) 5^) M ^ quality vector 

of S. Now, we choose an arbitrary S for which this vector is minimal. (Note that 
there could, potentially, be multiple incomparable minimal elements.) 

We now show that the following holds for S and all t ^ R\: 



g. 



□ 




(Z,t) = max 




I'eL 
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for all I G Lr, and 



for all I G Ls- (Note that the order between the derivation is maintained on the 
complete right e-environment i?*.) 

The first of these claims is a trivial consequence from the proof of Theorem 3. 
(The result is, for example, the same if we had a single player CTMDP that, 
in the; locations Lg of the safety player, has only one possible action: the one 
chosen hy S.) 

Let us assume that the second claim does not hold. Then we choose a par- 
ticular I G Lg where it is violated. Let us consider a slightly changed setting, 
in which the choices in I are restricted to two actions, the action ai chosen by 
S, and the minimising action 02. Among these two, one maximises, and one 
minimises 

^' ^ I'eL 

Let us fix all other choices of S, and allow the reachability player to choose 
among ai and 02 (we 'pass control' to the other player). As shown in Theorem 3, 
she will select an action that produces the well defined set of max equations for 
the resulting single player game. Hence, choosing ai and keeping all other choices 
from R{S) is the optimal choice for the reachability player in this setting (as the 
max equations are satisfied, while they are dissatisfied for 02). 

Consequently, the quality vector for S is strictly greater than the one for 
the adjusted strategy. That is, assuming that choosing an arbitrary maximal 
element does not lead to a satisfaction of the min and max equations leads to a 
contradiction. 

We can argue symmetrically for the left e-environment. Note that the sat- 
isfaction of the min and max equations implies that it does not matter if we 
change the role of the safety and reachability player in our argumentation. □ 

This lemma can easily be extended to construct simple co-optimal strategies: 

Theorem 6. For CTMGs without discrete locations, there are cylindrical deter- 
ministic timed-positional co-optimal strategies for the reachability and the safety 

player. 

Proof. First, Lemma 3 provides us with an open coverage of co-optimal strategies 
that switch at most once, and we can build a strategy that switches at most 
finitely many times from a finite sub-cover of the open space [0,tmax]. This 
strategy is everywhere locally co-optimal, and forms a Nash equilibrium: 

It is straight forward to cut the interval [0,fmax] into a finite set of sub- 
intervals [0,to], (*o,ii], • • • , {tn-l^tn] with i„ = imax, such that the strategy for 
the safety player is constant in all of these intervals. We can use the construc- 
tion from Theorem 3 (note that the proof of Theorem 3 does not use that the 
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diflFerential equations are initialised to or 1 at imax) to construct an optimal 
strategy for the reachability player: We can first solve the problem for the inter- 
val [tn-i,tn], then for the interval [tn-2,tn-i] using foptihtn-i) as initialisation, 
and so forth. A similar argument can be made for the other player. 
This provides us with the same differential equations, namely: 

-Uptil, t) = max^ V R(l, a, I') ■ (/opt(r, t) - Utih t)) 

for t G [0, tmax] and I G Lr, and 

-/opt(/, t) = min V a, I') ■ {Upt{l' , t) - Upt{l, t)) 

for t e [0, tmax] and I e Lg. 

Note that all Nash equilibria need to satisfy these equations (with the ex- 
ception of sets, of course), because otherwise one of the players could improve 
her strategy. □ 

The extension of these results to the full class of CTMGs is straight forward: 
We would first reprove Theorem 5 in the style of the proof of Theorem 3 (which 
requires to establish the Theorem in the first place). The only extension is that 
we additionally get an equation Prf (r) = Xlz'gL ' ^'''i/i''') every 

discrete location I. The details are moved to Appendix C. 

Theorem 7. For continuous-time Markov Games, there are cylindrical deter- 
ministic timed-positional co-optimal strategies for the reachability and the safety 
player. 

As a small side result, these differential equations show us that we can, for 
each continuous location Ic G and every action a G Act{lc), add arbitrary 
values to R(lc,a, 'c) without changing the bounded reachability probability for 
every pair of schedulers. (Only if we change R{lc,a,lc) to we have to make 
sure that a is not removed from Act{lc)-) In particular, this implies that we can 
locally and globally uniformise a continuous-time Markov game if this eases its 
computational analysis. (Cf. [10] for the simpler case of CTMDPs.) 

7 Variances 

In this section, we discuss the impact of small changes in the setting, namely 
the impact of infinitely many states or actions, and the impact of introducing a 
non-absorbing goal region. 

Infinitely Many States. If we allow for infinitely many states, optimal solutions 
may require infinitely many switching points. To sec this, it suffices to use one 
copy of the CTMDP from Figure 1, but with rates i and 2i for the i-th copy, 
and assign an initial probability distribution that assigns a weight of 2~' to the 
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Fig. 4. An example CTMDP with infinitely many actions. 



initial state At of the i-th copy. (If one prefers to consider only systems with 

bounded rates, one can choose rates 1 + 7 and 2 + | .) The switching points are 
then different for every copy, and an optimal strategy has to select the correct 
switching point for every copy. 

Infinitely Many Actions. If wo allow for infinitely many actions, there is not even 
an optimal strategy if we restrict our focus to CTMDPs with two locations, an 
initial location and an absorbing goal location. For the CTMDP of Figure 4 with 
the natural numbers N as actions and rate Xi — 2 — i for the action i € N if we 
have a reachability player and Aj = i if we have a safety player, every strategy 
S can be improved over by a strategy <S' that always chooses the successor i + 1 
when of the action i chosen by S. 

Reachability at ^max- If we drop the assimiption that the goal region is absorbing, 
one might be interested in the marginally more general problem to be (not to 
be) in the goal region at time tmax for the reachability player (safety player, 
respecively). For this generalisation, no substantial changes need to be made: It 
suffices to replace 

/opt(^)i) = P^s^iht) = 1 for all goal locations I S G and all t < tmax 

by 

/opt(^, *max) = Prs^ {I, tmax) = 1 for all goal locations / e G. 

(In order to be flexible with respect to this condition, the — /opt(^.t) are de- 
fined for goal locations as well. Note that, when all goal locations are absorbing, 
the value of —fopt{l,t) is and fopt{l,t) is 1 for all goal locations I G G and all 
t e [0, w].) 
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Appendix 



As to be expected by the topic, the paper is based in large parts on measure 
theory. While the teehniqiies are standard and straight forward for the experts 
in the field we provide a short introdiiction to the ideas exploited; while we do 
not use any technique beyond the standard curriculum of a math degree, we 
assume that some recap of the ideas behind the completion of metric spaces and 
its application in our measures in Section A. However, Section A is but a short 
introduction to the ideas, and cannot serve as a self contained introductory to 
the techniques. 

Section B contains a short recap on the differential equations that describe 
/mill and /max: and, more generally, the development of Pr'^{l,t) in Subsec- 
tion B.l, and a proof that /max truly establishes an upper bound on the perfor- 
mance of any measurable scheduler in Subsection B.2, which constitutes a proof 
of Lemma 1. 

A Completion of Metric Spaces 

A metric space is called complete if every Cauchy sequence in it converges. A 
Cauchy sequence in a metric space (M, d) is a sequence s : N — >■ M such that 
the following holds: 

Ve > 3n G N V^, m > n. d{s(l), s{m)) < s 

Intuitively one could say that a Cauchy sequence converges, but not 
necessarily to a point within the space. For example, a sequence of rational 
numbers that converges to a/2 is a converging sequence in the real numbers, 
but not in the rationals — as the limit point is outside of the carrier set — but it 
is still a Caiichy sequence. 

The basic techniqiie to complete an incomplete metric space (Af , d,) is to 
use the Cauchy sequences of this space as the new carrier set Af, and define 
a distance function d' between two Cauchy sequences s, s' G M' of M to be 
d'(s, s') = lim d((s{n), ,s'(n))). Now, (A'/', d') is not yet a metric space, because 

two different Cauchy sequences — for example the constant sequence and the 
sequence s{n) = ^ of the rationals — can have distance 0. 

Technically, one therefore defines equivalence classes of Cauchy sequences 
that have distance with respect to d' as the new carrier set M" of a metric space 
(M", d"), where the distance function d" is defined by using d' on representatives 
of the quotient classes of Cauchy sequences. This also complies with the intuition: 
a Cauchy sequence in M is meant to represent its limit point (which is not 
necessarily in M), and hence two Cauchy sequences with the same limit point 
should be identified. 

The resulting metric space {M",d") is complete by construction. The sim- 
plest example of such a completion is the completion of the rational numbers 
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into the real numbers. And on this level, a straight forward effect of completion 
can be easily explained: To end up with (M", d") (or a space isomorphic to it), 
we can start with any dense subset S of M" . 

A subset S C M" is dense in M" if, for every point m G M" and every 
£ > 0, there is a point s G S with d"{m, s) < e. Looking at the definition, one 
immediately sees the connection to Cauchy sequences: One could intuitively say 
that S C M" is dense in M" if, for every point m G M", there is a Cauchy 
sequence s with limit point m. 

Hence, it does not matter which dense set we use as a starting point. Of 
course, it works to use M", in the example of the real numbers, we can start 
with the real numbers themselves without gaining anything by applying the com- 
pletion twice, wc can start with the transcendent numbers, the non-transcendent 
numbers, or, more down to earth, with finite decimal fractions. Note that a sub- 
set is dense in M" if it is dense in some S that is dense in M" . 



A.l Application in Measure Theory 

Another famous application of this completion technique is the completion of 

Riemann integrable functions to Lebesgue integrable functions. The difference 
metrics between two functions is the Riemann integral over the absolute value 
of their difference. (Strictly speaking, this does again not form a metric space, 
and we again have to use the quotient class of functions with difference 0.) 

Riemann integrable functions do, for example, allow only for bounded func- 
tions, but there are other problems as well; for example, we cannot integrate 
over the characteristic function of the rational numbers. (The wikipcdia arti- 
cle to Riemann integrable functions is nice and gives a good overview on the 
weaknesses.) 

Using the completion technique defined above, one can, for example, integrate 
over the characteristic function of the rationals by enumerating them, that is, 
by defining a surjection s : N ^ Q, and choose fi to be the Riemann integrable 
function that is 1 at the mapping s({l, 2, . . . , i}) of the initial sequence of length i 
of the naturals. Clearly, the limit of the sequence /i, /2, /s, . . . is the characteristic 
function, the Riemann integral over all fi is (no matter over which interval we 
integrate) the sequence is a Cauchy sequence. 

The completion of the space of Riemann integrable functions (which essen- 
tially establishes the Lebesgue integrable functions) is space of all Cauchy se- 
quences of Riemann integrable functions (or again representatives of the quotient 
classes of equivalent Cauchy sequences). 

Again, it does not make a difference if we start the completion with the 
Riemann integrable functions, or with weaker concepts, as long as they are dense 
in the resulting space, or indeed in the Riemann integrable functions. A well 
known example for such a class is the class of block functions, where the value 
changes only in finitely many positions. 
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A. 2 Application in Our Measure 

The definition of the measures for continuous-time Markov chains, games (with 
fixed strategies), and decision processes (with a fixed scheduler) works in exactly 
this way: For Markov chains, one defines the probabihty measure for simple 
disjoint (or almost disjoint) sets, for which the probability is simple to determine, 
and such that one can define the probability of reaching the goal region in time 
for cylindrical sets. 

When considering games and decision processes without fixed scheduling poli- 
cies, however, we have two layers of completions — one layer for a given cylindrical 
schedulers, and one on cylindrical schedulers. 

Measure for a given cylindrical schedulers. In the case of games and 
decision processes, one starts to define it for a particularly friendly and easy to 
handle class of strategies or schedulers, respectively, such that the techniques 
from Markov chains can be extended with minor adjustments. Our cylindrical 
schedulers — which in our paper also represent pairs of strategies in games — with 
only finitely many switching points are an example of such an extension. 

The measure for a given cylindrical scheduler is a mild extension of the tech- 
niques for Markov chains. The building blocks of the probability measure define 
the probability on cylindrical sets, and they are a straight forward extension 
of the probability measure for continuous time Markov chains to continuous 
time Markov decision processes (and games) with a fixed scheduler of this type. 
However, they only describe the likelihood for cylindrical sets, and without com- 
pleting the space we can but use finite sums over disjoint sets. 

To obtain the time bounded reachability probability, we use Cauchy se- 
quences of such sets that converge against all sets of paths on which a goal 
region is reached. A representative of this equivalence class would be a sequence 
Pi, P2, P3, ■ . ■ or sets of paths, where Pi contains the cylindrical sets of length 
up to i in which the goal region is reached. This is a Cauchy sequence (cf. the 
argument in Subsection B.2), and the limit contains all finite paths upon which 
the goal region is reached. 

Measure for a given cylindrical schedulers. While we have established a 

measure for a given cylindrical scheduler, the class of cylindrical schedulers is 
not particularly strong, and, like with Riemann integrable functions, we need 
to strengthen the class of schedulers we allow for. In order to exploit the afore- 
mentioned completion technique, we have to create a suitable metric space on 
them, and in order to introduce such a metric space, we need a measure for the 
difference between strategies. Such a measure reflects the likelihood that two 
different strategies ever lead to different actions. For example, for deterministic 
schedulers D and E for a CTMDP M we define a difference scheduler S[d,e} 
that uses the actions of D/E on every history, on which they coincide, and a 
fresh action {aij,aE) if D chose a/j and E chose aE 7^ ar> upon this history. 
(Note that the difference measure uses a slightly adjusted CTMDP A^'; it also 
has a fresh goal location.) 
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The new action (au, a^) leads to a fresh continuous goal location g; the old 
goal locations do not remain goal locations, the new goal region contains only g. 
For a continuous location Z, we fix 

- R(Z, {aD,aE),g) — R-(/, L) + R(?, aE,L) and 

— R(i, {ud, as), I') = for all locations V ^ g 

for the new actions, and maintain the entries to in the rate matrix for the old 
actions. (To be formally correct, we provide g with only one enabled action a — 
with R(gi,a, Z') = for all locations V ^ g — that 5{d,e} selects in g upon any 
history.) 

For discrete locations, we would fix P(/, (ac a^;), ff) = 1 (and 
P(i, (od, Ob), Z') = for I' ^ g) for all new actions, and maintain the entries 
for the old actions in P. 

The distance between two schedulers is then defined as the likelihood that g 
is reached in the adjusted CTMDP within the given time bound when using the 
cylindrical scheduler 5^d,e}- 

The metric space on cylindrical schedulers. This distance function almost 

defines a metric space on the cylindrical schedulers we used as a basic building 
block; symmetry, triangle inequation and non-negativity obviously hold. How- 
ever, we again have to resort to a carrier set of quotient classes of schedulers 
with distance 0, in order to satisfy d{x, y) x = y. (One can think of different 
actions on unreachable paths.) 

Effect of the metric. The distance is defined in a way that guarantees that 
the absolute value of the difference between the time bounded reachability prob- 
abilities for D and E is bounded by the distance between these schedulers (or 
representatives of their quotient class): To see this, it suffices to look at the 
scheduler (5{£),b} in -M. with two adjusted goal regions: 

1. If we use the old goal region plus g as our new goal region, then obviously 
the time bounded reachability is better than the time bounded reachability 
of D and E in M. 

2. If we use the old goal region (but not plus g g rather becomes a non- 
accepting sink — as a new goal region, then the time bounded reachability is 
worse than the time bounded reachability of D and E in M. 

The difference between the two cases, however, is exactly the distance be- 
tween D and E, which therefore in particular is an upper bound on the difference 
between their time bounded reachability probability. 

Completion. The time bounded reachability of a Cauchy sequence can there- 
fore be defined as the limit of the time bounded reachability of the elements of 
the sequence, which is guaranteed to exist (and to be unique) by the previous 
argument. 
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Extension to randomised schedulers. If the schedulers D and E from above 
are randomised, then we would define the distribution chosen by 5^d,e} in lins 
with the definition from above. We first fix an arbitrary global order on all 
actions used as tie breaker in our construction. 

— If, in a particular history that ends in some location I , D chooses an action a 
with probability and E chooses a with probability then (5{ d,e} chooses 
a with probability mm{pD,pE} 

— If the probabilities assigned by this rule sum up to one for <5{d,b} on this 
history is thus defined. 

— Otherwise, if I is continuous, wc choose an action ao among the actions 
with > p'^ that maximises R(^, a, L), using the fixed global order as a tie 
breaker. For discrete locations, we use the tie breaker only. 

— We then accordingly choose an action qe among the actions with p'j^ > p'^ 
that maximises R(/,a,i), using the fixed global order as a tie breaker. For 
discrete locations, we again use the tie breaker only. 

— We assign the remaining probability weight to the decision {aD,aE)- 

All arguments from above extend to this case. 

Measurable schedulers. The term m,easurahle scheduler refers to the limit 
scheduler of such a Cauchy sequence. To restrict the attention to this class of 
schedulers is simply a requirement caused by the definition of the measure: Only 
for such schedulers the time bounded reachability probability is defined. 

However, we do not think this is a real drawback, as the set of measurable 
schedulers far outreaches the power of everything one might have in mind when 
talking about schedulers (like choosing a on the points in time defined by the 
Cantor set), and its not easy to describe a non- measurable scheduler in the first 
place. 

B Optimal reachability probability 

B.l DiflFerential Equations 

The differential equations defining /opt are simply the differential equations in 
place when a strategy is locally constant. This holds almost everywhere (every- 
where but in a 0-sct of positions) in case of the cylindrical schedulers that are 
the basic building blocks in the incomplete space that we have completed by 
considering Cauchy sequences of cylindrical schedulers. 

Hence, for every cylindrical scheduler we can partition the interval [0, fmax] 
into a finite set of intervals Iq, /i, ■ ■ In as described in the preliminaries. 

Within such an interval, 

-Prs{7T,t) = ^ K{l,a,l') ■ (Prs(7r ^ l',t) - Prs(7r,i)) for t € h 
I'eL 



24 



holds for discrete schedulers, where tt is a timed path that ends in /, a is the 

deterministic choice the scheduler makes in 7^ on this history, and tt I' is its 
extension. For randomised schedulers, 

-Prs(7r, t)= Ka) ^('' ^')- (^^s(7r ^ t) - Prs{-K, t)) for t e h 

aeAct I'eL 

holds, where tt is a timed path that ends in I, h{a) is the likelihood that the 

cylindrical scheduler makes the decision a in /j on this history, and tt V is 

its extension. 

To initialise the potentially infinite set of differential equations, we have the 
following initialisations: 

o Prs{TT, t) = 1 holds for all timed histories tt that contain (and hence end up 

in) locations ^ S G in the goal region and all t < tmax, 
o Prs (tt, tmax) = holds for all timed histories tt that contain only non-goal 

locations / <^ G, and 
o Prs{TT, t) — holds for all locations I G L and all t > tmax- 

Additionally, we have to consider what happens at the intersection ti of the 
fringes of 7j and /j+i for < i < n. But obviously, we can simply first solve the 
differential equations for /„, then use the values of /(tt, as initialisations 

for the interval /„_i, and so forth. 

Remark: For timed positional deterministic schedulers we get 

-Prs{l,t) = Y R-(^,a,0 • {Prs{l',t) - Prs{l,t)) for t e and 
I'eL 

-Prsil, t)= Yl E • - /(^' *)) * ^ 

aeAct I'eL 

for timed positional randomised schedulers. In both cases, the initialisation reads 

o Prs{l,t) ~ 1 holds for all goal locations / G G and all t < tmax; 
o Prsihtmax) = holds for all non-goal locations / ^ G, and 
o Prs{l,t) = holds for all locations I G L and all t > fmax- 

Obviously, these differential equations can also be used in the limit. 

B.2 Timed positional schedulers suffice for optimal time-bounded 
reachability 

In this subsection wc sketch a proof of a variant of Lemma 1; we demonstrate 
the following claim for arbitrary tmax > 0: 

Lemma 4. For a CTMDP M. with only continuous locations, Pr^{l,t) < 
/max(',t) holds for every scheduler S, every location I, and every t G [0,finax]- 
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In the proof, we assume a scheduler that provides an 3e better result, and 
then sacrifice one e to transfer to cylindrical schedulers (going back to the simpler 
incomplete space of cylindrical schedulers, but with completed reachability mea- 
sure), and then sacrificing a second e to discard long histories from consideration 
(going back to the simple space of finite sums over cylindrical sets). 

As a result, we can do the comparison in a simple finite structure. 

Proof. Let us assume that the claim is incorrect. Then, there is a CTMDP 
M. with location /q and a scheduler S^e for M such that the time bounded 
reachability probability is at least 3e higher for some £ > and to G [0,tmax]- 
That is, Pr-^Jla,to) - /max(Zo,io) > 3e 

Let us fix appropriate A^, and 63^. (Note that S^^ does not have to be 
timed positional or deterministic.) 

Recall that is the limit point of a Cauchy sequence of cylindrical 
schedulers. Hence, almost all of these cylindrical schedulers have distance < s 
to Sse- 

Let us fix such a cylindrical scheduler with distance < £ to S^^- The time 
bounded reachability probability of S2e is still at least 2e higher compared to 
/max- That is, Prsl^{lo,tQ) - /max(?o,io) > 2£ holds true. 

For S2e, we now consider a tightened form of time bounded reachability, 
where we additionally require that the goal region is to be reached within 
steps. We choose big enough that the likelihood of seeing more than discrete 
events is less than e. We call this time bounded reachability. 
Remark: We can estimate by taking the maximal transition rate 
Amax — max{R(^,a, L) | I G Lc,a € Acf\, and choose big enough that 
the likelihood of having more than rig transitions was smaller than e even if 
all transitions had transition rate Amax- As the number of steps is Poisson 
distributed in this case, a suitable is easy to find. 

The adjustment to time bounded rig reachability leads to a small change in 

the initialisation of the differential equations: For timed histories tt of length 
> Ue that do not contain a location ? e G in the goal region within the first 
steps, we use f{n,t) = (even if it contains a goal region after more than rig 
steps) for all t G [0, tmax]- As the probability measure of all timed histories tt of 
length > rie is < £, time bounded reachability for 625 is still at least £ higher 
than the value for /max- 

Let us use /(tt, t) to express the time bounded reachability for S2e on a 
path 77 at time t. Then this claim can be phrased as f{lo, to) — /max(^Oi ^o) > £- 

Wc have now reached a finite structure, and can easily show that this leads 
to a contradiction: We provide an inductive argument which even demonstrates 
that /max(',t) > /(tt,*) holds for all tt that end in I and all t G [0,tmax]- 

As a basis for our induction, this obviously holds for all timed histories 
longer than 11^: in this case, /max(',t) = 1 or /(tt, t) = holds true (where the 
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or is not exclusive). 

For our induction step, let us assume we have demonstrated the claim for all 
histories of length > n. Let us, for a timed history tt of length n that ends in I 
and some point t G [0,tmax] assume that /max(',i) < fi'^Tt). 

For I G G the initialisation conditions immediately lead to the contradiction 
1 < /(tt, t). For I ^ G, we can stepwise infer 

-/max(/, t) = max { R(l, a, I') ■ {U.Al', t) - /max(Z, t)) \ 

a is enabled in 1} 

^ '^distribution X^i'GL "^C' ' (/maxC, ~ funaxiU t)) 

> Ed^str^but^on El'eL ^(^^ ' (/max(r, t) - /(tT, t)) (with 

> J:d,.s,n.buUon eL ^ «, I') ' {f ^ I', *) ' f{^, *)) (with I.H.) 

= -f{^,t). 

Taking into account that /max(', imax) and /(7r,tmax) are both 
initialised to for Z ^ G, we can, using the just demonstrated 
/max(^t) < /(7r,t) /max(^,i) < /(7r,t), infcr fm!,x{l,t) > /(tt, t) for all 
t e [0,tniax]: This inequation holds on the right fringe of the interval (initialisa- 
tion), and when we follow the curves of f{l,t) and /max(^,t) to the left along 
[0,tmax], then every time / would catch up with /max, / cannot fall steeper 
than /max (where 'fall' takes the usual left-to-right view, in the right-to-left 
direction we consider one should maybe say 'cannot have a steeper ascend') at 
such a position, and hence cannot not get above /max- 

In particular, /(Zo,to) < /max(^o,to), which contradicts the initial assump- 
tion. 

The min case can be proven accordingly, which provides a full proof of 
Lemma 1. 

(Note that the extension to CTMDPs with both discrete and continuous 
locations is provided in Section 5.) 

C Reproof of Theorem 5 

To lift Theorem 6 to the full class of CTMGs, we reprove Theorem 5 in the style 
of the proof of Theorem 3. Recall that Theorem 3 establishes the existence of 
an optimal cylindrical scheduler using the existence of an optimal measurable 
scheduler, and the form of the (differential) equations defining the time-bounded 
reachability probability for it. The proof given in this appendix can therefore not 
been used to supersede the proof in the paper. 

First we observe from the proof of Theorem 5 that, for discrete locations 
I G Ld, the equations 

/opt(«, t) = opt V P{1, ai, I') ■ /opt(r, t) for t e [0, tmax], 
a e ActH) 
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holds for opt e {min, max}, and that they together with the differential equations 
for the continuous locations (the differential equations remain unchanged), define 

/opt- 

The difference in the proof of Theorem 8 compared to the proof of Theorem 3 
are marked in blue. 

Theorem 8. For a single player continuous-time Markov game with only a 
reachability player, there is an optimal deterministic scheduler with finitely many 
switching points. 

Proof. We have seen that the true optimal reachability probability is defined by 
a system of equations and differential equations. In this proof we consider the 
effect of starting with the 'correct' values for a time t e [0,iniax], but locally 
fix a positional strategy for a small left or right e-environment of t. That is, 
we consider only schedulers that keep their decision constant for a (sufficiently) 
small time e before or after t. 

Given a CTMG A4 , we consider the equations and differential equations that 
describe the development of the reachability probability for each location I under 
a positional deterministic strategy D: 

-PrPir) = ^i^^^i^l') ■ (P^Pi^) - P^Fir)) for I G L,, 

Pr?{T) = P(^«^^') • P'^?i^) for ^ e id, 
/'ei 

where a; is the action chosen at Z by 13, starting at the support point /max(^,0- 

Different to the development of the true probability, the development of these 
linear differential equations provides us with smooth functions. This provides us 
with more powerful techniques when comparing two locally positional strategies: 
Each deterministic scheduler defines a system y = Ay of ordinary homogeneous 
linear differential equations with constant coefficients. 

As a result, the solutions Pr]^ [t) of these differential equations — and 
hence their differences Prf (r) — Prf (r) — can be written as finite sums 
^"^-^ Pj(r)e^''^, where Pi is a polynomial and the may be complex. Con- 
sequently, these functions are holomorphic. 

Using the identity theorem for holomorphic functions, t can only be a limit 
point of the set of points of Prf (t) " Prfir) if Prf (t) and Prf (r) are 
identical on an e-environment of t. The same applies to their derivations: 
Prf (r) — Pr['{T) either has no limit point in t, or Prf (r) and Prf{T) are 
identical on an e-environment of t. 

For the remainder of the proof, we fix, for a given time a sufficiently small 
e > such that, for each pair of schedulers D and D' the following holds: for 
every location I e Lc, Prf (t) — Prf{T) is either < 0, = 0, or > on the 
complete interval L* = (t — e, t) n [0, imax] 3 t, and, possibly with different sign, 
for the complete interval R\ = {t,t A- e) f] [0, tmax] 9 t; and for every location 
I e Ld, Prfij) — Prf (r) is either < 0, = 0, or > on the complete interval 



28 



L* = (t — e,t) n [0,imax] 9 T, and, possibly with different sign, for the complete 
interval i?* = {t,t + e)n [0, t,nax] 9 t. 

We argue the case for the left e-environment L* . In the '>' case for a location 
I, we say that D is I -better than We call Z? preferable over £)' if D' is not 
^-better than D for any location /, and better than D' if _D is preferable over D' 
and Z-better for some / G 

If D' is /-better than D in exactly a non-empty set C £ of locations, then 
we can obviously use D' to construct a strategy Z?" that is better than D by 
switching to the strategies of D' in exactly the locations Li,. 

Since we choose our strategies from a finite domain — the deterministic po- 
sitional schedulers — this can happen only finitely many times. Hence we can 
stepwise strictly improve a strategy, until we have constructed a strategy -Dmax 
that is preferable over all others. 

By the definition of being preferable over all other strategies, I?max satisfies 

-Pr^-^iT) = max ^ R(/, a, r)-(Fr?— (r)-Frf "^''(t)) for aU r G L*,/ e 
Prf'^^^{T)^ max ^ P(/, a, /') • ^^^"'"^(t) for aU t e i* , / e L^. 

d ^ Act (0 

^ ' I'eL 

We can use the same method for the right e-environment i?*, and pick the 
decision for t arbitrarily; we use the decision from the respective left e environ- 
ment. 

Now we have fixed, for an e-environment of an arbitrary t S [0,<max], an 
optimal scheduler with at most one switching point. As this is possible for all 
points in [0,imax], the sets /* = U i?* define an open cover of [0,tmax]- Using 
the compactness of [0,tniax]) we infer a finite sub-cover, which establishes the 
existence of a strategy with a finite number of switching points. □ 

Again, the proof for single player safety games runs accordingly. 

Theorem 9. For a single player continuous-time Markov game with only a 
safety player, there is an optimal deterministic scheduler with finitely many 
switching points. 

D Prom Late to Early Scheduling 

Our main motivation for introducing discrete transitions is not the slightly im- 
proved generality of the model (nice though it is as a side result), but the intro- 
duction of a framework that covers both early schedulers (which have to fix an 
action when entering a location), and the late schedulers used in the paper. 

Late schedulers are naturally subsumed in our model, as the schedulers we 
assume are the more powerful late schedulers. To embed early schedulers as well, 
it suffices to use a simple translation: we 'split' every continuous location Ic into 
a fresh discrete location l^, and one fresh continuous location l^ for each action 
a e Actifc) enabled in Ic- 
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Fig. 5. An informal example, depicting the idea of the encoding of an early scheduling 
CTMG (left) in a late scheduling CTMG (right). 



Every incoming transition to Ic is re-routed to l^, has an outgoing transition 
a that surely leads to {P{lf,a,l^) = 1) for each action a € Act{lc) enabled 
in Ic, and no other outgoing transition. In we have Act{l^) = {a}, and the 
entries in R(Z",a,Z) are the entries taken from Ii{lc,a,l) for discrete locations 
I, and re-routed to the respective /^^ for continuous locations. Probability mass 
assigned to Ic is moved to by the translation, and if Ic is a goal state, so are 
If and the l^'s. 

Intuitively, every occurrence of Ic is replaced by if 

Ic is the beginning of the path, Ic is replaced by if 

Obviously, there is a trivial bijection between early schedulers for a thus 
translated CTMDP (or, indeed, CTMG), and the late schedulers in the mapping: 
the actions chosen in a discrete location are doubled, and there is no alternative 
to doubling it. 

As a consequence, the existence of finite deterministic optimal control extends 
to early scheduling. 

Prom Early to Late Scheduling. As a side remark, we would like to point 
out that a similar translation can be used to reduce finding optimal control for 

late schedulers to finding optimal control for early schedulers. Following up on 
the remark that — assuming late scheduling — we can work with the uniformisa- 
tion of a CTMG when seeking co-optimal control, it suffices to establish such a 
translation for uniform CTMGs, that is, for CTMGs where the transition rate 
R(/c,a,L) is constant for all continuous locations Ic G L, and all their enabled 
actions a G Act{lc)- And for such uniform CTMGs we can move the decision 
into a fresh discrete location after the continuous location, just as we moved it 
to a fresh discrete location before the continuous location in our reduction from 
late to early scheduling. 
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